Intro

Welcome to this non-exhaustive list and description of the basics of programming in R, and of some of the most useful commands to work with datasets of the kind you would get in a typical psychology experiment (e.g. responses, reaction times, etc.).

For me it is useful as a place where I can quickly look things up.

For you it might serve the same purpose. If instead you are new to R or to programming in general, this doc (and the links in it) can be a good place to get started.

Be aware that although this doc is primarily about R, you will sometimes find references to and comparisons with the Matlab language (another program I and psychologists/neuroscientists in general like to use).

Good luck!!

Sebastian Korb,
Ph.D. Faculty of Psychology, University of Vienna, Austria
My webpage For comments please write to sebkorb1[at]gmail.com


Why do R (and not SPSS) ?

Free and widely used

R is completely free, and there is a large community using it. World’s best statisticians and programmers constantly improve old functions and create and distribute new R functions and packages. If you have a question about how to do anything in R, you most likely will find an answer to it through a Google search.

Combine code, analyses, plots, and text

Using RStudio (an editor that interfaces and helps with R) and its RMarkdown function, you can combine R code, analyses, plots, and written text into 1 nice-looking document. This allows to easily share all your data manipulation and analysis steps, making them transparent, shareable, and reproducible

Driving a car vs. riding a bus

Wanting to compare R and SPSS, a certain Greg Snow wrote the following on stackoverflow:

Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off (and you need to pay your fare). Cars on the other hand require much more work, you need to have some type of map or directions (even if the map is in your head), you need to put gas in every now and then, you need to know the rules of the road (have some type of drivers licence). The big advantage of the car is that it can take you a bunch of places that the bus does not go and it is quicker for some trips that would require transfering between busses.

Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed.

R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the pasenger seat, and mountain climbing and spelunking gear in the back.

R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.

Large variety and flexibility of graphs in R:

e.g. show word frequency with a word cloud

see examples in the R-graph gallery,
and description of the wordcloud2 package here

Several word clouds were included in this review article:

To create your own word clouds, you need to install the wordcloud2 package

library(wordcloud2)
head(demoFreq) # some word frequency count
##          word freq
## oil       oil   85
## said     said   73
## prices prices   48
## opec     opec   42
## mln       mln   31
## the       the   26
# Now let's make the corresponding word cloud.  
# Notice how frequent words like "oil" appear larger  
# compared to less frequent words like "the"
wordcloud2(data = demoFreq)  
Make interactive graphs

Use the package Plotly to create interactive graphs,
which you can also embed in your website.
This allows you to better explore, understand, and explain your data.

install.packages("plotly")

Let’s make interactive boxplots with the dataset “midwest” from the package “ggplot2”.
It contains demographic information of midwest counties across 5 states (IL, IN, MI, OH, WI).
We’ll focus on the percentage of people with a college education.

library(plotly)
data(midwest)
p <- plot_ly(midwest, x = ~percollege, color = ~state, type = "box")
p

Moving the cursor over the graph allows you to see the values corresponding to the quartiles and the outliers.

For example, you can see that in Wisconsin
Dane county (where the capital Madison lies) has a 43.6 % of college educated people.

as.character(midwest[midwest$state == "WI" &  midwest$percollege >= 43, "county"])
## [1] "DANE"

How to set up R:

* There is also the possibility to use a function from a package without loading it, or better by loading it temporarily. For example, to use the chordDiagram function from the package circlize:

circlize::chordDiagram(matrix(sample(10),  nrow = 4, ncol = 5)) 


Packages

Packages are sets of functions that some smart R-wizard wrote and made available for everybody.

Most packages can be downloades from the Comprehensive R Archive Network (CRAN). But actually you don’t need to go to that website and instead can do it from within R or RStudio.

For that hit the “Pachages” tab (probably on the lower right of your RStudio), then “Install”, and enter the name of the desired package. RStudio should download this automatically. Alternatively use the function install.packages("NAME.OF.YOUR.PACKAGE").

Now the package has been installed, and will be included in the list of packages in RStudio. However, you still need to load it everytime you want to use it. For that either click the box next to the package name in the “Packages” tab of RStudio. Or use this command: library(NAME.OF.YOUR.PACKAGE) (no more quotation marks needed this time).

If you want to see a list of all the packages that you have currently loaded, type search(), and to see where on your computer these packages are stored type searchpaths()

Some of the packages I use most are: ggplot2 for graphs, lme4 for linear mixed models, but also car, Hmisc .


How to install old packages

If you are not using the latest R version (e.g. because your old OS does not allow it),
it might be necessary to install older versions of certain packages.
THIS link explains you how to do it.

require(devtools)
install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")

Know your version of R

Sometimes a package might not install or run because your R version is too old. To find out which version you are using, type version in the Console and hit Enter

version
##                _                           
## platform       x86_64-apple-darwin15.6.0   
## arch           x86_64                      
## os             darwin15.6.0                
## system         x86_64, darwin15.6.0        
## status                                     
## major          3                           
## minor          6.1                         
## year           2019                        
## month          07                          
## day            05                          
## svn rev        76782                       
## language       R                           
## version.string R version 3.6.1 (2019-07-05)
## nickname       Action of the Toes

How to master the basics of the R language

How to cite R

R Markdown

I have used R Markdown to create this file.

In a nutshell, R Markdown allows you to write a neat looking .html or .pdf file containing both text and R code (or code in other languages like Bash or Python).

The code can be executed and you can have the results show up in your .html file.

I highly recommend checking it out and using it for teaching (you can also create presentations - similar to what you would do with Powerpoint) or to create a data analysis report that can be shared with you coworkers.

Some places where you might find help to learn R Markdown on the internet. Here is one useful resource, and here another one.

Or read the free online version of the book R Markdown: The Definitive Guide, by Xie, Allaire, and Grolemund.

Commands, objects and functions

# for example, if I wanted to put the year 2018 into an object (also called variable) with the name thisyear, I'd write
thisyear <- 2018
# this automatically creates it and one way to check is by typing just the name 
thisyear   # R is case sensitive, so be careful with the big and small caps!
## [1] 2018

Scripts

Everything you can do in the Console you can also do in a Script

Using scripts allows you to save your commands, and reuse them on later occasions.
This also allows to save time be adapting parts of old scripts to analyze new data

To open a new script, go to File > New File > Script … or use the shortcut (on a Mac it is: cmd + Shift + Enter)

To execute the entire script, hit the RUN button

To execute 1 line, put the cursor somewhere in that line and hit crtl + Enter (windows) or cmd + Enter (Mac)

To execute more than 1 line select them (holding down left mouse button) and then hit crtl/cmd + Enter

Comment generously

I strongly advise to put a lot of comments into your R code. This will make it easier to understand for your colleagues with whom you might share it, but also for you when you look at it after some months or years.

In R, everything appearing after a hashtag (#) is interpreted as a comment, and therefore not executed as code.

In contrast to other programming languages (e.g. Python), you however cannot write comments that span several lines. To do that, write # at the beginning of each new line.

Environment

My list of R commands

Get the working directory

By default R goes to the directory in which R is saved on your computer.

Verify with:

getwd()  # find out which folder on your computer R is looking at
## [1] "/Users/seba/WORK/Teaching/R/R_commands_list_Seba"

Set the working directory

Type setwd("/DISK/FOLDER/") to change working directory to Or use the mouse to go to Session > Set Working Directory

Create objects

An object is anything that you creates and that can contain numbers, text, plots, or statiscal models. Per convention objects are created by writing the name of the object followed by <- and then whatever you want to put into it. It does also work if you replace the <- with = (as it is done in Matlab), but R users will almost certainly look down on you.

# for example, if I wanted to put the year 2017 into an object (also called variable) with the name thisyear, I'd write
thisyear <- 2017
# this automatically creates it and one way to check is by typing just the name 
thisyear   # R is case sensitive, so be careful with the big and small caps!
## [1] 2017

Clear objects

When I start a new R script I typically first delete everything in the Workspace (in RStudio it is called Environment) For that I use:

rm(list=ls())

If instead you want to delete only a specific object, say the variable a :

a <- 1  # let's create the variable with name "a" and fill it with the integer 1
rm(a)
# or remove several variables at once (however without emptying the Workspace)
rm(a, b, c)

Check what objects are in your Environment

Of course you could do this simply by hitting the Environment tab…. …but here is how you do it by writing in the Console:

a <- 1  # let's create variable a again
ls()
## [1] "a"

Create a new folder

Myfolder <- "/Users/seba/TESTFOLDER/"
dir.create(Myfolder)

List files in a folder

To list all the files in a specific directory, you can use the function list.files()

list.files(Myfolder)  # if you don't write anything it lists what is in the working directory

You can also look for files with a specific extension

# To only list files containing the .html extension
Sourcedir = Myfolder
list.files(path = Sourcedir, pattern = ".html")

Or you might want to be even more specific, and indicate an extension and some text in the middle of the name

# To select .html files that also contain "course" in the name
# You create "wildcards" with `.*` 
Sourcedir = Myfolder
list.files(path = Sourcedir, pattern = ".*course.*.html")

You come to the same solution using the function Sys.glob() , which allows to do wildcard expansion (also known al “globbing”) on file paths

Sys.glob( "*course*.html")

Data classes

There are several data classes in R, and the practical thing is that R mostly figures out automatically, what data class a variable should contain.

  • Numeric Can be integers or decimals
Thisyear <- 2019
class(Thisyear)
## [1] "numeric"
  • Integer Cannot be decimals
ThisInteger <- as.integer(Thisyear)
class(ThisInteger)
## [1] "integer"
  • Character Put characters between single (’’) or double ("") quotes.
Nextyear <- "2019"
class(Nextyear)
## [1] "character"
  • Logical Can only be FALSE (F) or TRUE (T). Also to keep in mind, that for R FALSE == 0, and TRUE == 1.
Is_2020_odd <- FALSE
class(Is_2020_odd)
## [1] "logical"

 

Data types

There are several types of data R can handle (e.g. see here for a list).

The data types you will be working with most of the time are these:

  • Vectors contain series of numbers (double), texts (characters or strings), or TRUE/FALSE answers (logical)
# create a numeric vector. use c() to combine values. 
a <- c(0, NA, 3.2, -15, sqrt(4))  # NA means not available; sqrt() computes the square root
# create a character vector
b <- c("Hello", "World", "how", "are", "you")  # the quotes (single or double both work) indicate a text
# create a logical vector
c <- c(TRUE, FALSE, TRUE, FALSE, FALSE)  # must all be caps!
  • Data frames are like Excel’s Worksheets, or SPSS datasets. Each column can contain different things (numeric, character, etc.)
# create a data frame with the vectors a, b,and c that we just created
my_first_dataframe <- data.frame(a,b,c)
# we could also change the column names (currently they are a, b, c)
colnames(my_first_dataframe) <- c("some_numbers", "my_sentence", "logic_test")
# now let's have a look at it
my_first_dataframe
##   some_numbers my_sentence logic_test
## 1          0.0       Hello       TRUE
## 2           NA       World      FALSE
## 3          3.2         how       TRUE
## 4        -15.0         are      FALSE
## 5          2.0         you      FALSE

You might have noticed, that I put underscores ( _ ) instead of spaces in the column names. The reason is that this makes working with them easier, trust me.

  • Lists are ordered collections of things, which can be wildly different (strings, numeric vectors, matrices, scalars, etc.). They are highly flexible
# create a list of 5 elements
my_first_list <- list(aName = "whatever", aNum = 3.7, NumericVector = a, CharacterVector = b, Matrix = my_first_dataframe)
  • Factors store nominal values, which is useful for statistical analyses. More precisely, factors store nominal values in one vector with integers (e.g. with each number corresponding to a unique condition in an experiment), and an internal vector of character strings (the original values) mapped to these integers
# the 2nd column in the just created data frame "my_first_dataframe"" is a factor  
# R has decided so by itself, since that column does not contain numbers or logical data  
# One way to find out is this: 
str(my_first_dataframe)
## 'data.frame':    5 obs. of  3 variables:
##  $ some_numbers: num  0 NA 3.2 -15 2
##  $ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5
##  $ logic_test  : logi  TRUE FALSE TRUE FALSE FALSE
# compare to the list "my_first_list" 
str(my_first_list)
## List of 5
##  $ aName          : chr "whatever"
##  $ aNum           : num 3.7
##  $ NumericVector  : num [1:5] 0 NA 3.2 -15 2
##  $ CharacterVector: chr [1:5] "Hello" "World" "how" "are" ...
##  $ Matrix         :'data.frame': 5 obs. of  3 variables:
##   ..$ some_numbers: num [1:5] 0 NA 3.2 -15 2
##   ..$ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5
##   ..$ logic_test  : logi [1:5] TRUE FALSE TRUE FALSE FALSE

** Change a factor to numeric or character It is possible to change a factor into a numeric vector (if it contains numbers), or into a vector filled with characters (if it contains text).

#  we add a new column to my_first_dataframe,
# call it "my_sentence_char", 
# and fill it with a character version of "my sentence" 
# (i.e. we transform from factor to character)
my_first_dataframe$my_sentence_char <- as.character(my_first_dataframe$my_sentence) 
# then we check again
str(my_first_dataframe)
## 'data.frame':    5 obs. of  4 variables:
##  $ some_numbers    : num  0 NA 3.2 -15 2
##  $ my_sentence     : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5
##  $ logic_test      : logi  TRUE FALSE TRUE FALSE FALSE
##  $ my_sentence_char: chr  "Hello" "World" "how" "are" ...

 

Read data in R

read.delim

(this is perfect for reading delimited files, defaulting to the TAB character for the delimiter)
but see also read.table(), read.csv(), read.spss()

Here is a useful tutorial by DataCamp on how to import different types of data into R

First you need to tell R where to find the data. You can do this by setting the working directory to the right folder using setwd()
But I actually have the habit of first defining a working directory in my script, and then concatenating its name with the name of the file I want to read in.

SourceDir <- "/Users/seba/TESTFOLDER/"
Mydata <- read.delim(paste(SourceDir,"MyFile.txt", sep=""), header = TRUE) 
# the function *paste* concatenates objects, in this case SourceDir and "MyFile.txt"
# with sep=""  I asked it to do so without adding anything in between

# if your data is a tab-delimited .txt file, you can indicate the separator
Mydata <- read.table(paste(SourceDir,"MyFile.txt", sep=""), header = TRUE, sep="\t") 

Built-in data

R comes with some data of its own (in the package datasets), and the same is true for many additional packages. On this web page some of the more common datasets are described. To know all the data your R has built in type data(). To know the data in a specific package type try(data(package = "datasets") )

In the package “datasets” you’ll find the data mtcars, which is one of the better known built-in datasets. Load it with data(mtcars)

data(mtcars)  # load into R the built-in dataset called "mtcars"
# this creates the object mtcars in the Environment

To get some information about this dataset type help(mtcars) or ?mtcars

You can also look at the actual data, either by typing mtcars in the Console (this will write all the data), or by double clicking with the left mouse button on mtcars in the Environment tab. This will open a new window, which will look very similar to an Excel Worksheet. The same can be obtained by typing View(mtcars) in the Console.

 

Saving data

You can save as

  • tab-delimited .txt file with write.table(mtcars, "mydata.txt", sep="\t") (this can be opened by every program, and can be read in Excel)

If you do not indicate a path, it will be saved in the working directory (remember that you can use getwd() to know, and setwd() to change the working directory)

I personally prefer defining a “Savedir” folder at the beginning of my script, and then use that when saving

Savedir <- "/Users/seba/my_R"
dir.create(Savedir)   # if you need to create the folder (it will not be overwritten if it exists)
write.table(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep=""), sep = "\t")
# this works almost fine. But better to exclude the row names using row.names=FALSE:
write.table(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep=""), sep = "\t", 
            row.names = FALSE)

You can also save directly to an .xlsx spreadsheet using write.xlsx()

library(xlsx)
write.xlsx(mtcars, paste(Savedir, "/this_is_my_dataframe.xlsx", sep=""), row.names = FALSE)

Save to SPSS using write.foreign() (but honestly this didn’t work for me, so I recommend importing the .txt file in SPSS instead)

library(foreign)
rownames(mtcars) <- c()
write.foreign(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep=""),  
              paste(Savedir, "/this_is_my_dataframe.sps", sep=""),  
              package = "SPSS")

 

Explore data (descriptive statistics)

class(mtcars)    # the mtcars object is a data.frame, something similar to an Excel spreadsheet
## [1] "data.frame"
dim(mtcars)      # get dimensions (rows, columns)
## [1] 32 11
nrow(mtcars)    # get the number of rows (specific for data.frames)
## [1] 32
ncol(mtcars)    # get the number of columns (specific for data.frames)
## [1] 11
names(mtcars)   # get the variablenames
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
library(car) 
some(mtcars)     # randomly selects 10 rows (part of the car package)
##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Valiant            18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360         14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Fiat 128           32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla     33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Dodge Challenger   15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin        15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Pontiac Firebird   19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ford Pantera L     15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
head(mtcars)     # shows the first 6 rows 
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)     # shows the last 6 rows
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
summary(mtcars) # summarizes data per column (min, median, mean, max, etc.)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
library(Hmisc)   # load the Hmisc package  (but the describe function also exists in the psych package)
describe(mtcars) 
##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am      -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29
str(mtcars)    # this is what you would get by clicking the >Arrow button next to the mtcars object in the Environment
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

 

Checking normality

Verify distribution, and check if variables follow a Normal distribution

A common assumption necessary for many statistical analyses is that the data is approximately normally distributed.
There are several ways of testing this.

For visual inspection, use a histogram and the qqPlot (quantile-quantile plot).

For the qqplot we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian “bell curve”), and look to see whether the points lie approximately on a straight line.

If the data is normally distributed, the points will align on the diagonal of the qqplot

The variable “weight” in the “ChickWeight” dataset clearly does NOT follow a normal distribution

# load the "car"" package to use the function qqPlot()
library(car)
par(mfrow = c(1,2))
hist(ChickWeight$weight)
qqPlot(ChickWeight$weight)

## [1] 400 399

For a numerical test of normality, use the Shapiro-Wilk test. If the test is signifiacnt (p<.05), then the data is NOT normally distributed.

But keep in mind that tiny variations from normality can lead to a significant Shapiro-Wilk test if the sample size is large (or the dataset contains many trials). Also, the sample size must be between 3 and 5000. Otherwise the test won’t work!

Thus do not blindly rely on this test, and always look at the histogram and qqplot too.

shapiro.test(ChickWeight$weight)
## 
##  Shapiro-Wilk normality test
## 
## data:  ChickWeight$weight
## W = 0.90866, p-value < 2.2e-16

Skewness, kurtosis, and the Shapiro-Wilk test are also obtained if using the stat.desc() function from the pastecs package.

Put norm = TRUE to include the Shapiro-Wilk test

library(pastecs)
stat.desc(ChickWeight$weight, basic = FALSE, norm = TRUE)
##       median         mean      SE.mean CI.mean.0.95          var 
## 1.030000e+02 1.218183e+02 2.956204e+00 5.806232e+00 5.051223e+03 
##      std.dev     coef.var     skewness     skew.2SE     kurtosis 
## 7.107196e+01 5.834258e-01 9.587845e-01 4.717394e+00 3.372147e-01 
##     kurt.2SE   normtest.W   normtest.p 
## 8.309955e-01 9.086615e-01 3.873675e-18

Checking homogeneity of variance

Another assumption for parametric statistics is the homogeneity of variances (Data from multiple groups have the same variance).

It can be tested with Levene’s test using the function leveneTest() from the car package (centering on the median by default).

If the Levene test is significant, it means that the assumption of homogeneity of variances has been violated.

library(car)
leveneTest(ChickWeight$weight, ChickWeight$Diet, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   3  9.6001 3.418e-06 ***
##       574                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indexing elements in a vector or dataframe

A straightforward way to index elements in a vector or matrix (data frame) is by giving the element number inside square brackets. If you want only 1 element of a vector, then 1 number in brackets is enough.

# For example, let's create a vector called "whatever" with 6 numbers:
whatever <- c(1, 2, 3, 4, 5, 6) 
whatever
## [1] 1 2 3 4 5 6
# btw, you get the same by writing it like this
whatever <- c(1:6)   # the colon tells R to include the numbers 1 and 6, and create all the numbers increasing by 1 in between
whatever   # let's check again
## [1] 1 2 3 4 5 6

Btw, in Matlab (another program that psychologists and neuroscientists like to use) you could easily create a vector in which numbers increase by 2: whatever_2 <- c(1:2:6) but unfortunately this does not work in R (you can do it with the function seq() though)

Now lets extract the 3rd number in the vector “whatever”.

whatever[3]    # Can you guess what it will be?
## [1] 3

And the 2nd to 5th element of “whatever”:

whatever[2:5]
## [1] 2 3 4 5

In case you were wondering what they are, the numbers in [square brackets] on the left of the output help you see how many elements there are

 

With dataframes it works slightly differently. When you work with dataframes, you need to input 2 numbers, i.e. rows and columns (in this order!) separated by comma.

Indexing of a Matlab matrix works similarly, with the difference that in Matlab you could also do linear indexing, which means going down the columns one by one.

mtcars[,3]     # all the elements (rows) of the 3rd column in the dataframe mtcars
##  [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6
## [12] 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0
## [23] 304.0 350.0 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0
mtcars[3,]     # all the elements (columns) of the 3rd row in the dataframe mtcars
##             mpg cyl disp hp drat   wt  qsec vs am gear carb
## Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1
mtcars[1:5,c(3,4)]     # The first 5 rows of columns 3 and 4 in the dataframe mtcars
##                   disp  hp
## Mazda RX4          160 110
## Mazda RX4 Wag      160 110
## Datsun 710         108  93
## Hornet 4 Drive     258 110
## Hornet Sportabout  360 175
# this gives the last 17 rows and the last 6 columns of the dataframe mtcars, based on the total number of rows and columns
mtcars[ (nrow(mtcars)/2):nrow(mtcars),  ceiling( (ncol(mtcars)/2) ):ncol(mtcars)]
##                        wt  qsec vs am gear carb
## Lincoln Continental 5.424 17.82  0  0    3    4
## Chrysler Imperial   5.345 17.42  0  0    3    4
## Fiat 128            2.200 19.47  1  1    4    1
## Honda Civic         1.615 18.52  1  1    4    2
## Toyota Corolla      1.835 19.90  1  1    4    1
## Toyota Corona       2.465 20.01  1  0    3    1
## Dodge Challenger    3.520 16.87  0  0    3    2
## AMC Javelin         3.435 17.30  0  0    3    2
## Camaro Z28          3.840 15.41  0  0    3    4
## Pontiac Firebird    3.845 17.05  0  0    3    2
## Fiat X1-9           1.935 18.90  1  1    4    1
## Porsche 914-2       2.140 16.70  0  1    5    2
## Lotus Europa        1.513 16.90  1  1    5    2
## Ford Pantera L      3.170 14.50  0  1    5    4
## Ferrari Dino        2.770 15.50  0  1    5    6
## Maserati Bora       3.570 14.60  0  1    5    8
## Volvo 142E          2.780 18.60  1  1    4    2

But a really cool thing in R is that it is very easy to index a column in a dataframe based on its name (since 2013, this is also possible when using the “Table” format in Matlab, but not as easy as it is in R).

# for example, I could first check again the names of the columns in mtcars
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# or use colnames()  which is basically the same, use rownames() for the row names
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# then I can index with the "$" sign the column called "hp", which describes the power of a motor in horse power
mtcars$hp  # this gives you all the elements (rows) in the column hp
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230
## [18]  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109
# notice that if you extract a column from a dataframe and save it in a new object it no longer will be a dataframe!
mtcars_hp <- mtcars$hp
class(mtcars_hp)  # we now got a vector of numeric elements
## [1] "numeric"
typeof(mtcars_hp)  # for R numbers are of the type "double"
## [1] "double"
# Oddly, if you extract a row from a dataframe it is still a dataframe (some things are just inconsistent I guess) 
mtcars_row3 <- mtcars[3,]
class(mtcars_row3)
## [1] "data.frame"

 

Coerce to data.frame

We could however also force mtcars_hp to be a dataframe, just as mtcars, from where it originally came from. This we do with as.data.frame()

mtcars_hp <- as.data.frame(mtcars$hp)
class(mtcars_hp)
## [1] "data.frame"

 

Finding unique values

What are all the unique numbers of cylinders that cars in the mtcars dataset have?

unique(mtcars$cyl)
## [1] 6 4 8

Or do it with the table() function, which also tells you how many car models exist for each number of cylinders

table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

In your own datasets the table() function might come handy to quickly check how many trials you have per participant and per condition!

 

Sorting

The function order() gives the ranked position of each element when it is applied to a variable

# for example, we can find the ascending rank of the elements of a
a
## [1]   0.0    NA   3.2 -15.0   2.0
order(a)
## [1] 4 1 5 3 2

The smallest element of a is -15, and it is in 4th position. The next smallest element is 0, which is the 1st element of a…. and so forth
Knowing this, we can use the output of order(a) to reshuffle a:

a[order(a)]
## [1] -15.0   0.0   2.0   3.2    NA

It works similarly with data frames:

# sort data frame mtcars by 1 variable
new_order <- order(mtcars$mpg)   # get the indexes corresponding to the smallest to largest mpg
mtcars_sort  <- mtcars[new_order,]  # now put the rows of mtcars into the new order
head(mtcars_sort)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
# sort matrix mtcars by 2 variables: miles per gallon and displacement
new_order <- order(mtcars$mpg, mtcars$disp)   # use mpg and disp
mtcars_sort  <- mtcars[new_order,]  
head(mtcars_sort)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
# sort by mpg ascending, and by cyl descending (by putting a minus sign)
new_order <- order(mtcars$mpg, -mtcars$cyl)
mtcars_sort <- mtcars[new_order,]
head(mtcars_sort)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

 

Replicate

Create a vector filled with 10 times the same number:

c(rep(7, 10))
##  [1] 7 7 7 7 7 7 7 7 7 7

Create a vector filled with 5 times the same string:

c(rep("hello", 5))
## [1] "hello" "hello" "hello" "hello" "hello"

 

Dealing with missing values

(For a really good video tutorial (including hands-on exercises) on missing and special values see here )

Missing values are represented by NA (not available),
but sometimes also by a dot (“.”) or an empty space.

The function is.na() helps you to find missing data in your dataset

# let's create a dataframe with 4 rows, 2 columns, and 3 missing values
mydata <- data.frame(A = c(2, 5, NA, 7), B = c(NA, 1, NA, 9))
mydata
##    A  B
## 1  2 NA
## 2  5  1
## 3 NA NA
## 4  7  9
# then check for missing data
is.na(mydata)
##          A     B
## [1,] FALSE  TRUE
## [2,] FALSE FALSE
## [3,]  TRUE  TRUE
## [4,] FALSE FALSE
# To simply know if there are ANY missing data use any():
any(is.na(mydata))
## [1] TRUE
# For R, TRUE == 1, and FALSE == 0
# Therefore, we can count the number of missing data with
sum(is.na(mydata))
## [1] 3

Another way to see the number of NAs is summary()

summary(mydata)
##        A               B    
##  Min.   :2.000   Min.   :1  
##  1st Qu.:3.500   1st Qu.:3  
##  Median :5.000   Median :5  
##  Mean   :4.667   Mean   :5  
##  3rd Qu.:6.000   3rd Qu.:7  
##  Max.   :7.000   Max.   :9  
##  NA's   :1       NA's   :2

complete.cases() tells you which rows do contain NAs (FALSE), and which ones do not (TRUE)

complete.cases(mydata)
## [1] FALSE  TRUE FALSE  TRUE

Use na.omit() to remove cases (i.e. rows) with NAs

# the result is a dataframe with only 2 rows and 2 columns
na.omit(mydata)
##   A B
## 2 5 1
## 4 7 9

Many functions only work if you indicate to remove missing values, which you do with na.rm = TRUE:

mean(mydata$A) # returns NA
## [1] NA
mean(mydata$A, na.rm=TRUE) # this works
## [1] 4.666667
colMeans(mydata)  # returns NA NA
##  A  B 
## NA NA
colMeans(mydata, na.rm = TRUE) # this works
##        A        B 
## 4.666667 5.000000

 

Trasforming data to deal with non-normality

Many common statistical tests assume that the data is normally distributed, and that the variances are similar across groups (or conditions). Violation of these assumptions can lead to wrong results.

Luckily, there are several ways to “beat” (transform) data to make it more normal. Many types of data transformations can be used. For an overview, and a useful tool, see the package bestNormalize() Alternatively, use nonparametric statistics.

For example, RT data is typically right-skewed, but can become more normal if you apply a log() or log10() transformation.

# create right-skewed distribution using rexgauss() from the package retimes
rt_dist1 <- rexgauss(5000, 300, 100, 200, positive = T)
hist(rt_dist1, xlab = "RT in milliseconds",
    axes=F,   main='Positive Skewed')

# test for normality -> it is not normal
shapiro.test(rt_dist1)
## 
##  Shapiro-Wilk normality test
## 
## data:  rt_dist1
## W = 0.91009, p-value < 2.2e-16
# histogram of log-transformed data -> looks much more normal!
hist(log(rt_dist1), xlab = "RT in milliseconds",
    axes=F,   main='after log() transformation')

# test for normality -> still not normal
shapiro.test(log(rt_dist1))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(rt_dist1)
## W = 0.9771, p-value < 2.2e-16

Thus although the Shapiro-Wilks remained positive, the data looked much more normally distributed after logarithmic transformation.

Based on visual inspection of the histogram, and keeping in mind that for example ANOVA is quite robust against violating the normality assumption, one could go on and do statistics on this data (but not everybody might agree).

Also keep in mind that the shapiro.test() function does not work with vectors with more than 5000 elements :(

 

Subsetting, splitting and merging data frames

 

Subsetting

We have already seen how to select only parts of a vector or data frame
(e.g. mtcars[1:3, 1:3] selects the first 3 rows and columns)

It is also easy to exclude only a few columns and keep all the rest:

# say you want to remove the 2nd column (cyl) and the 4th column (hp)
mtcars_pruned <- mtcars[c(-2, -4)]  # put a - in front of the columns you want to remove
head(mtcars_pruned)
##                    mpg disp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0  160 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0  160 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8  108 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4  258 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7  360 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1  225 2.76 3.460 20.22  1  0    3    1
# or we could do so by indicating the names of the columns;
# let's first find them
cols_to_remove <- which(colnames(mtcars) == "disp" |  colnames(mtcars) == "hp")  # the | sign means 'or'
# then remove them
mtcars_pruned <- mtcars[-cols_to_remove]
head(mtcars_pruned)
##                    mpg cyl drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 2.76 3.460 20.22  1  0    3    1

Another useful function is subset()

# to select the cars with 4 cylinders and at least 70 horse power:
subset(mtcars, cyl == 4  &  hp > 70)
##                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710    22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230      22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E    21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

 

Combining data frames

Use cbind() to put data frames next to each other (add columns), and rbind() below each other (add rows)

mtcars_wide <- cbind(mtcars_pruned, mtcars)
colnames(mtcars_wide)
##  [1] "mpg"  "cyl"  "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb" "mpg" 
## [11] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Or combine them in the other direction

mtcars_long <- rbind(mtcars[1:2, ], mtcars[31:32,])  # stack the first 2 and the last 2 rows
dim(mtcars_long)
## [1]  4 11

But this does not work if the 2 data frames differ in the other dimension (number of variables)

# e.g. you cannot put a data frame with 9 columns on top of one with 11 columns
mtcars_doesntwork <- rbind(mtcars_pruned, mtcars)  # you get an error
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

 

Combining datasets with different sizes

Use cbind.fill() from package rowr to combine horizontally data frames with different N of rows.
E.g. this does not work: cbind(mtcars, iris)

library(rowr)
# ATTENTION: if you do not specify "fill=NA" the rows of the smaller data frame will be repeated!
cars_and_flowers_wide <- cbind.fill(mtcars, iris, fill = NA)  
cars_and_flowers_wide[32:33,1:12]
##     mpg cyl disp  hp drat   wt qsec vs am gear carb Sepal.Length
## 32 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2          5.4
## 33   NA  NA   NA  NA   NA   NA   NA NA NA   NA   NA          5.2

Use rbind.fill() from the package plyr to combine vertically data frames with different N of columns.
Again, this does not work: rbind(mtcars, iris)

library(plyr)
cars_and_flowers_long <- rbind.fill(mtcars, iris)
cars_and_flowers_long[32:33,1:12]
##     mpg cyl disp  hp drat   wt qsec vs am gear carb Sepal.Length
## 32 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2           NA
## 33   NA  NA   NA  NA   NA   NA   NA NA NA   NA   NA          5.1

Now let’s check which cases are complete (have no missing values)

dim(cars_and_flowers_wide)
## [1] 150  16
complete.cases(cars_and_flowers_wide)  # (in this case only the first 32 cases are complete)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Select subset of non-missing cases in a data frame with na.omit()

dim(cars_and_flowers_wide)
## [1] 150  16
cars_and_flowers_wide_noNA = na.omit(cars_and_flowers_wide)
dim(cars_and_flowers_wide_noNA)
## [1] 32 16

 

Merging data frames

Useful for example to combine RTs with questionnaire data (often in separate files)

# load dataset "iris"
data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# split in 2 different datasets (but both contain the 5th column "Species")
iris_1 <- iris[, c(1,2,5)]
iris_2 <- iris[, c(3,4,5)]
# then merge them again (order matters)
iris_merged <- merge(iris_2, iris_1, by = "Species")
head(iris_merged)
##   Species Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1  setosa          1.4         0.2          5.1         3.5
## 2  setosa          1.4         0.2          4.9         3.0
## 3  setosa          1.4         0.2          4.7         3.2
## 4  setosa          1.4         0.2          4.6         3.1
## 5  setosa          1.4         0.2          5.0         3.6
## 6  setosa          1.4         0.2          5.4         3.9

 

Merge by two variables

iris_3 <- iris[, c(1,4,5)]
iris_merged <- merge( iris_3, iris_2, by = c("Species", "Petal.Width") )
head(iris_merged)
##   Species Petal.Width Sepal.Length Petal.Length
## 1  setosa         0.1          4.9          1.4
## 2  setosa         0.1          4.9          1.1
## 3  setosa         0.1          4.9          1.5
## 4  setosa         0.1          4.9          1.4
## 5  setosa         0.1          4.9          1.5
## 6  setosa         0.1          4.3          1.4

 

Change data frame from LONG to WIDE format

In the wide format, a subject’s repeated responses will be in a single row, and each response is in a separate column.

In the long format, each row is 1 time point x subject. So each subject will have data in multiple rows. Any variables that do not change across time (e.g. subjects’ gender) will have the same value in all the rows.

To change from LONG to WIDE use dcast() from the “reshape2” package

data("ChickWeight")  # import the chicken weight data
# somehow the order of the levels of the factor $Chick are messed up, so let's change them
ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick))
some(ChickWeight)
## Grouped Data: weight ~ Time | Chick
##     weight Time Chick Diet
## 10     171   18    15    1
## 12     205   21    15    1
## 45     136   16    11    1
## 61      41    0    12    1
## 203     82   12    10    1
## 257     42    0    21    2
## 291    236   20    27    2
## 321     87    8    29    2
## 403     61    4    33    3
## 556     53    2    46    4
length(unique(ChickWeight$Time))
## [1] 12

The weight of most chicken was measured at 12 different time points.
But some chicken were measured less often.
You can verify with the table() function

# see the first 6 rows (chicken). A 0 indicates that no measurement occurred
head(table(ChickWeight$Chick, ChickWeight$Time))  
##    
##     0 2 4 6 8 10 12 14 16 18 20 21
##   1 1 1 0 0 0  0  0  0  0  0  0  0
##   2 1 1 1 1 1  1  1  0  0  0  0  0
##   3 1 1 1 1 1  1  1  1  0  0  0  0
##   4 1 1 1 1 1  1  1  1  1  1  1  1
##   5 1 1 1 1 1  1  1  1  1  1  1  1
##   6 1 1 1 1 1  1  1  1  1  1  1  1
library(reshape2)
# now transform to wide format using dcast():
Chicken_wide <- dcast(ChickWeight, 
        # after the name of the dataset put the variables to keep as they are
    Chick + Diet 
        # then ~ name of variable to split into separate columns
    ~ Time, 
        # then name of the dependent variable
    value.var = "weight")
head(Chicken_wide)
##   Chick Diet  0  2  4  6  8 10 12 14 16  18  20  21
## 1     1    1 39 35 NA NA NA NA NA NA NA  NA  NA  NA
## 2     2    1 41 45 49 51 57 51 54 NA NA  NA  NA  NA
## 3     3    1 41 49 56 64 68 68 67 68 NA  NA  NA  NA
## 4     4    1 41 48 53 60 65 67 71 70 71  81  91  96
## 5     5    1 42 51 59 68 85 96 90 92 93 100 100  98
## 6     6    1 41 47 54 58 65 73 77 89 98 107 115 117
# improve the header
colnames(Chicken_wide)  # let's check the column names first
##  [1] "Chick" "Diet"  "0"     "2"     "4"     "6"     "8"     "10"   
##  [9] "12"    "14"    "16"    "18"    "20"    "21"
ts <- rep("t_", 12)  # create a vector with 12 times "t_"
newnames <- paste(ts, colnames(Chicken_wide)[3:14], sep="")   # concatenate "t_" with colnames
colnames(Chicken_wide)[3:14] <- newnames
head(Chicken_wide)
##   Chick Diet t_0 t_2 t_4 t_6 t_8 t_10 t_12 t_14 t_16 t_18 t_20 t_21
## 1     1    1  39  35  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA
## 2     2    1  41  45  49  51  57   51   54   NA   NA   NA   NA   NA
## 3     3    1  41  49  56  64  68   68   67   68   NA   NA   NA   NA
## 4     4    1  41  48  53  60  65   67   71   70   71   81   91   96
## 5     5    1  42  51  59  68  85   96   90   92   93  100  100   98
## 6     6    1  41  47  54  58  65   73   77   89   98  107  115  117

 

Change data frame from WIDE to LONG format

Use the melt() function from the “reshape2” package

head(Chicken_wide)
##   Chick Diet t_0 t_2 t_4 t_6 t_8 t_10 t_12 t_14 t_16 t_18 t_20 t_21
## 1     1    1  39  35  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA
## 2     2    1  41  45  49  51  57   51   54   NA   NA   NA   NA   NA
## 3     3    1  41  49  56  64  68   68   67   68   NA   NA   NA   NA
## 4     4    1  41  48  53  60  65   67   71   70   71   81   91   96
## 5     5    1  42  51  59  68  85   96   90   92   93  100  100   98
## 6     6    1  41  47  54  58  65   73   77   89   98  107  115  117
data_long <- melt(Chicken_wide,
        # id.variables = all the variables to keep w/o splitting
    id.vars=c("Chick", "Diet"),
        # measure.vars = the columns to put from wide into long format
    measure.vars=c(colnames(Chicken_wide)[3:14]),
        # Name of the destination column
    variable.name="Time",
          # column that the measurement came from
    value.name="weight")
head(data_long)
##   Chick Diet Time weight
## 1     1    1  t_0     39
## 2     2    1  t_0     41
## 3     3    1  t_0     41
## 4     4    1  t_0     41
## 5     5    1  t_0     42
## 6     6    1  t_0     41

If you leave out measure.vars melt() will automatically use all the remaining variables as id.vars.
The reverse is true if you leave out id.vars.

 

Arithmetic operators

## 
## -----------------------------
##  Operator     Description    
## ---------- ------------------
##     +           addition     
## 
##     -         subtraction    
## 
##     *        multiplication  
## 
##     /           division     
## 
##  ^ or **     exponentiation  
## 
##     %%          modulus      
## 
##    %/%      integer division 
## -----------------------------

Examples with arithmetic operators

# divide column hp by column cyl
mtcars$hp / mtcars$cyl
##  [1] 18.33333 18.33333 23.25000 18.33333 21.87500 17.50000 30.62500
##  [8] 15.50000 23.75000 20.50000 20.50000 22.50000 22.50000 22.50000
## [15] 25.62500 26.87500 28.75000 16.50000 13.00000 16.25000 24.25000
## [22] 18.75000 18.75000 30.62500 21.87500 16.50000 22.75000 28.25000
## [29] 33.00000 29.16667 41.87500 27.25000
# use the modulus to check which numbers in column gear are odd or even
mtcars$gear %% 2  # 0 is even, 1 is odd
##  [1] 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0

 

Logical operators

## 
## --------------------------------------
##  Operator          Description        
## ---------- ---------------------------
##     <               less than         
## 
##     <=        less than or equal to   
## 
##     >             greater than        
## 
##     >=      greater than or equal to  
## 
##     ==          exactly equal to      
## 
##     !=            not equal to        
## 
##     !            not (something)      
## 
##     |                  or             
## 
##     &                  and            
## 
##  isTRUE()   test if something is true 
## --------------------------------------

Examples with logical operators

# find which cars have 2 carburetors
mtcars$carb == 2
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [23]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
# find which cars DO NOT HAVE 2 carburetors
mtcars$carb != 2
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [23] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
# or
! mtcars$carb == 2
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [23] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
# find cars that have at least 100 hp and less than 8 cylinders
(mtcars$hp >= 100  &  mtcars$cyl < 8)
##  [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE

 

Compute means by subject and/or condition

Use function aggregate() in package stats

# load data "ChickWeight", which contains Weight versus age of chicks on different diets
data(ChickWeight)   
head(ChickWeight)
## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
Weight_by_Diet <- aggregate(ChickWeight$weight, by = list(Diet = ChickWeight$Diet),  
                            FUN = mean, na.rm = TRUE)
colnames(Weight_by_Diet)[2] <- "mean weight" # let's add a nicer header for the 2nd column
Weight_by_Diet
##   Diet mean weight
## 1    1    102.6455
## 2    2    122.6167
## 3    3    142.9500
## 4    4    135.2627

You could also aggregate by time instead

Weight_by_Time <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time), FUN = mean, na.rm = TRUE)
colnames(Weight_by_Time)[2] <- "mean weight" 
Weight_by_Time
##    Time mean weight
## 1     0    41.06000
## 2     2    49.22000
## 3     4    59.95918
## 4     6    74.30612
## 5     8    91.24490
## 6    10   107.83673
## 7    12   129.24490
## 8    14   143.81250
## 9    16   168.08511
## 10   18   190.19149
## 11   20   209.71739
## 12   21   218.68889

Or aggregate using a different function (e.g. SD)

Weight_by_Time <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time), FUN = sd, na.rm = TRUE)
colnames(Weight_by_Time)[2] <- "SD time" 
Weight_by_Time
##    Time   SD time
## 1     0  1.132272
## 2     2  3.688316
## 3     4  4.495179
## 4     6  9.012038
## 5     8 16.239780
## 6    10 23.987277
## 7    12 34.119600
## 8    14 38.300412
## 9    16 46.904079
## 10   18 57.394757
## 11   20 66.511708
## 12   21 71.510273

Or aggregate by 2 variables (Time & Diet)

Weight_by_Time_and_Diet <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time, Diet = ChickWeight$Diet), FUN = mean, na.rm = TRUE)
colnames(Weight_by_Time_and_Diet)[3] <- "Mean Weight"
some(Weight_by_Time_and_Diet)
##    Time Diet Mean Weight
## 17    8    2     91.7000
## 19   12    2    131.3000
## 25    0    3     40.8000
## 29    8    3     98.4000
## 30   10    3    117.1000
## 37    0    4     41.0000
## 38    2    4     51.8000
## 41    8    4    105.6000
## 44   14    4    161.8000
## 47   20    4    233.8889

Or use ddply() in package plyr, which also allows to simultaneously compute N, M, and SD

library(plyr)
Weight_by_Time_and_Diet <- ddply(ChickWeight, .(Diet, Time), summarise, 
              N = length(weight),
              M = round(mean(weight, na.rm=TRUE),2),
              SD = round(sd(weight, na.rm=TRUE),2)  )
head(Weight_by_Time_and_Diet)
##   Diet Time  N     M    SD
## 1    1    0 20 41.40  0.99
## 2    1    2 20 47.25  4.28
## 3    1    4 19 56.47  4.13
## 4    1    6 19 66.79  7.76
## 5    1    8 19 79.68 13.78
## 6    1   10 19 93.05 22.54
tail(Weight_by_Time_and_Diet)
##    Diet Time  N      M    SD
## 43    4   12 10 151.40 15.28
## 44    4   14 10 161.80 15.73
## 45    4   16 10 182.00 25.30
## 46    4   18 10 202.90 33.56
## 47    4   20  9 233.89 37.57
## 48    4   21  9 238.56 43.35

Plotting (with base graphics)

Plotting (or data visualisation) is essential to
1) better understand your data
2) summarizing and conveying the results of your analyses

There are several sets of plotting functions in R.
We will start with the “Base graphics” system, which offers quick but limited plotting possibilities.

The “ggplot2” system, which we’ll discuss further on, is a bit more complicated to use but provides much more options and possibilities.

When you create a plot in RStudio, it is automatically shown in the “Plots” pan on the lower right.

From there, you could export it to an image (e.g. jpg or png) or a .pdf using the GUI (later, we will see other ways how to do this)

See here for a hands-on tutorial on plotting and data visualization in R

Scatter plot (2 variables)

Scatterplots can easily be made using the plot() function. This is a generic function, meaning that it can adapt to different data types. With it, one can create scatterplots, dot plots, line plots, etc.

Let’s do a simple scatterplot with the 2 variables “hp” (horsepower) and “mpg” (miles per gallon) from the dataset mtcars

plot(mtcars$hp, mtcars$mpg)

As expected, there seems to be a negative correlation between horsepower and the number of miles you can do with one gallon of gas, i.e. the stronger the motor, the higher the gas consumption and thus the lower the miles per gallon.

Now let’s modify the axis labels and add a title

plot(mtcars$hp, mtcars$mpg, xlab="Horse Power", ylab="Miles x Gallon", 
     main="Horse Power by Miles x Gallon in mtcars")

Notice that you could plot the same by using R’s formula interface:
plot(mpg ~ hp, data = mtcars), in this case the y-axis is defined first.

To see more options of the plot() function type ?graphics::plot and Enter in the console

The par call

Sets graphics parameters (fonts, colors, axes, titles), and keeps them until the next par call.

Useful for example to put multiple plots in the same figure.

# create a plot array with 1 row and 2 columns
par(mfrow = c(1,2))
plot(mtcars$hp, mtcars$mpg, xlab="Horse Power", ylab="Miles x Gallon", main = "plot 1")
plot(mtcars$cyl, mtcars$carb, xlab="cylinders", ylab="carburetors", main = "plot 2")

But there are many more parameters. Check the current ones with par()
See this link for a description.

CAREFUL: you cannot use parto combine plots made with ggplot2 (see below).

Scatter plot (more variables)

In order to explore the data, one could also use scatterplots comparing more than 2 variables.

Let’s create a plot array with the first 4 variables in the mtcars dataset:

plot(mtcars[,1:4])

Line plot

Could be done with the plot() function, putting the “type” argument to “l”

# let's order mtcars by nunber of cylinders
mtcars_sorted <- mtcars[order(mtcars$cyl), ]
plot(mtcars_sorted$hp,  type = "l", xlab="cars ordered by cyl", ylab="horsepower", main="this graphs doesn't make much sense")  

# possible types are "p" for points, "l" for lines, "b" for both, etc. -> see ?plot()

Or do first a scatterplot with plot() and then add the lines with lines()

plot(mtcars_sorted$hp)
lines(mtcars_sorted$hp, type = "l")

Sunflower plot

Sunflowerplots are useful when the same data combination gets repeated many times.

For each repetition, a petal is added:
* a black dot without petals = 1 single data point
* 2 red petals = 2 identical data points
* 3 red petals = 3 identical data points, etc.

sunflowerplot(mtcars$cyl, mtcars$carb, xlab ="number of cylinders", ylab ="number of carburetors", main ="Sunflowerplot of cyl by carb")

Box plot

Useful to see the distribution of a numerical variable changes over the levels of a categorical variable, or another numerical variable that has only a few values.

The box of the boxplot includes: 1st quartile, median, and 3rd quartile.

By default when using boxplot() the whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile range from the box.
Everything beyond is an outlier (marked with points).

The whiskers can however be changed, e.g. range = 0 causes the whiskers to extend to the data extremes.

Let’s see gas consumption by number of cylinders

boxplot(mtcars$mpg ~ mtcars$cyl, xlab="cylinders", ylab="miles per gallon")

# now extend the whiskers to the data extremes (no more outliers)
boxplot(mtcars$mpg ~ mtcars$cyl, range = 0, xlab="cylinders", ylab="miles per gallon")

Mosaic plot

Like a scatterplot between 2 categorical variables, or between numerical values with only a few values.

A line is shown for when a category is absent.

mosaicplot(cyl ~ carb, data= mtcars)

Pie charts

These look cool but are not recommended.
Even the help (?pie) for this function thinks so:
“Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”

# first let's get a count of the number of cars per number of carburetors 
tbl <- table(mtcars$carb)
# then let's make a pie chart out of it
pie(tbl)
title("amount of cars with 1,2,3,4,6, or 8 carburetors")

Barplot

Now we show the same data as a bar plot instead. Remember that ideally you’d also add errorbars - this is just a simple example!

tbl <- table(mtcars$carb)
# then let's make a bar chart out of it
barplot(tbl, xlab = "number of carburetors", ylab = "number of cars")
title("amount of cars by carburetors")

This is actually a histogram!

Histogram

Shows the frequencies of occurrences for each level of a given variable.
Useful also to quickly see the distribution of the data (is it normal?).
Let’s again plot the number of cars that have a specific amount of carburetors, but this time with the function hist()

# ATTENTION: if you do not want to get the counts of cars with 1 or 2 carburetors to be summed together, you need to change the breaks!!
hist(mtcars$carb, breaks=seq(0.5,8.5,1), freq=TRUE)

QQ plot

A common assumption necessary for many statistical analyses is that the data is approximately normally distributed.
There are several ways of testing this. One of them is the qqPlot (quantile-quantile plot).

For the qqplot we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian “bell curve”), and look to see whether the points lie approximately on a straight line.

If the data is normally distributed, the points will align on the diagonal of the qqplot

The variable “weight” in the “ChickWeight” dataset does not follow a normal distribution

# load the "car"" package to use the function qqPlot()
par(mfrow = c(1,2))
hist(ChickWeight$weight)
qqPlot(ChickWeight$weight)

## [1] 400 399

But if we select only measures at Time == 16,
then the assumption of normality seems more reasonable

index16 = which(ChickWeight$Time == 16)
weights = ChickWeight[index16, "weight"]
par(mfrow = c(1,2))
hist(weights, breaks=seq(1,320,20))
qqPlot(weights)

## [1] 32 18

And here an example of data that is definitely not normally distributed

library(MASS)
par(mfrow = c(1,2))
hist(Boston$tax)
qqPlot(Boston$tax)

## [1] 489 490

Plotting with ggplot2

ggplot2 was written by Hadley Wickham, a kiwi statistician, professor, and chief scientist at RStudio (check out his website)
 

It is “a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.” (http://ggplot2.org)
 

Plotting functions in base R can still be useful to quickly explore the data.
But if you want anything beyond those simple graphs, it’s best to switch to ggplot2.
 

“ggplot2 provides a unified interface and set of options, instead of the grab bag of modifiers and special cases required in base graphics. Once you learn how ggplot2 works, you can use that knowledge for everything from scatter plots and histograms to violin plots and maps.” (R graphics cookbook by Winston Chang, p.7)  

“In R’s base graphics functions, each mapping of data properties to visual properties is itw own special case, and changing the mappings may require restructuring your data, issuing completely different graphics commands, or both.” (R graphics cookbook by Winston Chang, p.374)

 

CAREFUL: you cannot use parto combine plots made with ggplot2 (see below). Use the cowplotor gridExtrapackages instead.

DISCLAIMER: I have copied a lot of examples from the R graphics cookbook by Winston Chang!

Bar and line plots using base graphics

# Create a 2x3 data frame
mydata <- matrix(c(10, 7, 12, 9,11,6), nrow = 2, ncol = 3, byrow = TRUE, 
                 dimnames = list(c("B1", "B2"),
                                 c("A1", "A2", "A3") ))
mydata
##    A1 A2 A3
## B1 10  7 12
## B2  9 11  6

 

Make a barplot with the data using function barplot() with As along x axis, and bars grouped by Bs

graphics::barplot(mydata, beside = TRUE ) # beside=TRUE draws separate bars for each data point, otherwise B1 and B2 overlap with diff colors

 

To turn the barplot around (Bs along the x axis and bars grouped by As), you first need to restructure the data (e.g. using the function t() for transpose )

t(mydata)
##    B1 B2
## A1 10  9
## A2  7 11
## A3 12  6

 

Then you can do another barplot:

barplot(t(mydata), beside = TRUE )

 

So far so good, transposing the data wasn’t that bad, one could say.
But things get more complicated if you want to make a lineplot.
To do this with base graphics you need to draw the lines for B1 and B2 in separate commands.

plot(mydata[1,], type="l")
lines(mydata[2,], type="l", col="blue")

And the result is not very convincing:

  • the lowest point on the blue line (B2) is not visible, because the range of the y-axis was set for the black line (B1)
  • the x-axis is numbered (what are 1.5 and 2.5 doing there?)

Bar and line plots with ggplot2

For ggplot2 the data must be in “long” format, and you never need to change it.
So first we put mydata in long format using melt() from the reshape2 package:

library(reshape2)
mydata_long <- melt(mydata, value.name="value")
colnames(mydata_long)[1:2] <- c("theBs", "theAs")
mydata_long
##   theBs theAs value
## 1    B1    A1    10
## 2    B2    A1     9
## 3    B1    A2     7
## 4    B2    A2    11
## 5    B1    A3    12
## 6    B2    A3     6

 

Now comes the barplot using ggplot(), and it will look better right away (e.g. notice the axis labels and the legend):

ggplot(mydata_long, aes(x=theAs, y=value, fill=theBs)) +
  geom_bar(stat="identity", position="dodge")

 

Turning it around and putting the Bs on the x-axis is simple:

ggplot(mydata_long, aes(x=theBs, y=value, fill=theAs)) +
  geom_bar(stat="identity", position="dodge")

 

If you wanted a line plot instead:

ggplot(mydata_long, aes(x=theBs, y=value, colour=theAs, group=theAs)) +
  geom_line()

 

Or this way?

ggplot(mydata_long, aes(x=theAs, y=value, colour=theBs, group=theBs)) +
  geom_line()

ggplot2’s terminology

  • DATA: must be data frames in long format, with each variable in a different column
  • GEOMS: geometric objects, e.g. bars, lines, plots
  • AESTHETICS: visual properties of geoms, e.g. position of axes, line color, point shapes, etc.
  • SCALES: control how the data is converted in the elements of the plot (the mapping), e.g. greater values of a continuous y-axis appear higher up
  • GUIDES: e.g. tick marks and axis labels

Many of the elements of ggplots are combined by joining them with the plus sign (+)
If you write the ggplot command over several lines, be careful to always put + at the end of the line!

Scatter plots with ggplot2

Let’s create some new data for a scatterplot

mydata <- data.frame(xval=1:4, yval=c(3,5,6,9), group=c("A", "B", "A", "B"))
mydata
##   xval yval group
## 1    1    3     A
## 2    2    5     B
## 3    3    6     A
## 4    4    9     B

 

Now we make a simple scatterplot with ggplot(),
but before plotting it we save it as “myplot”

myplot <- ggplot(mydata, aes(x=xval, y=yval))  +  geom_point()
myplot # then plot it by calling its name

 

You can change or add more elements to the saved plot using +
For example let’s plot groups A and B in different colors:

myplot + geom_point(aes(colour=group))

 

Or you can change the range of the x-axis:

myplot + scale_x_continuous(limits=c(-5, 10))

 

Change the colors:

myplot + geom_point(aes(colour=group)) + scale_colour_manual(values=c("orange", "forestgreen"))

Scatter plots with plot() vs. ggplot()

plot(mtcars$wt, mtcars$mpg)

library(ggplot2)
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() +
  theme(panel.background = element_blank()) +   
  # to remove the gray background
  theme(axis.line = element_line(colour = "black"))  # otherwise the axes are nolonger visible

Bar plot with errorbars

# let's create some data
mydata <- data.frame(group=c("A", "A", "A", "A", "A",  "A", "B", "B", "B", "B", "B", "B"),  Time= as.factor( c(1,2,1,2,1,2,1,2,1,2,1,2) ), yval=c(11,3,5,6,9,1,2,3,4,5,1,0) )
mydata
##    group Time yval
## 1      A    1   11
## 2      A    2    3
## 3      A    1    5
## 4      A    2    6
## 5      A    1    9
## 6      A    2    1
## 7      B    1    2
## 8      B    2    3
## 9      B    1    4
## 10     B    2    5
## 11     B    1    1
## 12     B    2    0

 

There are (at least) 2 ways to add errorbars representing standard error of the mean (SE).
Here we will discuss both and put the resulting plots next to each other (using par()).

The 1st way is to compute the SEs yourself (SD/sqrt(N)), and then plot them with geom_errorbar

mydata_summary <- ddply(mydata, c("group", "Time"), summarise, 
                        M = mean(yval, na.rm=TRUE),
                        SD = sd(yval, na.rm=TRUE),
                        N = sum(!is.na(yval)),
                        SE = SD/sqrt(N))
mydata_summary # look at the computed parameters
##   group Time        M       SD N        SE
## 1     A    1 8.333333 3.055050 3 1.7638342
## 2     A    2 3.333333 2.516611 3 1.4529663
## 3     B    1 2.333333 1.527525 3 0.8819171
## 4     B    2 2.666667 2.516611 3 1.4529663
library(ggplot2)  # load ggplot2, if necessary
errorbars1 <- ggplot(mydata_summary, aes(x=Time, y=M, fill=group))  + 
  geom_bar(stat="identity", position = position_dodge(.5), width=.5) +
  geom_errorbar(aes(ymin=M-SE, ymax=M+SE), position = position_dodge(.5), width=.2) +
  ggtitle("SEs by hand")
# position_dodge defines the spacing between bars (per default it is the same as the bar width, i.e. 0.9)
# position="dodge" is the short form of position=position_dodge()

 

The 2nd (easier) way is to have ggplot do it for you using fun.data = mean_se:

errorbars2 <- ggplot(mydata, aes(x=Time, y=yval, fill=group))  + 
  stat_summary(fun.y="mean", geom="bar", position=position_dodge(.5), width=.5) +  
  # the height of the bars represents the mean per Time and group
  stat_summary(fun.data = mean_se, geom = "errorbar", position=position_dodge(.5), width=.2) +
  # the errorbars show the SE
  ggtitle("Automatic SEs")

 

We use the package cowplot (see here) to arrange the 2 plots next to each other, and label them with A and B.
Using this package also makes the plots a bit nicer (white background, etc.)

#######    if necessary install and/or load cowplot:
# install.packages("cowplot")
library(cowplot)
plot_grid(errorbars1, errorbars2, labels = c("A", "B"))

 

Reassuringly, the 2 graphs look exactly the same.

Box plot with data points (grouped by 1 factor)

# ATTENTION: for the x-axis you need a factor, or a vector of characters
ggplot(mtcars, aes(as.factor(cyl), mpg)) +
                  geom_boxplot() +
                  geom_jitter(width = 0.2) +  # we can move each data point a bit to the right or left, to make it visible
                  labs(x = "N of cylinders")

Box plot with data points (grouped by 2 factors)

# plot
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg, fill = as.factor(vs))) +
  geom_boxplot() + 
  geom_point(alpha=0.5, 
             position=position_jitterdodge(jitter.width=0.2), 
             aes(group=as.factor(vs))) 

  #### Violin plot The better boxplot, as it also shows the distribution of the data (a kernel density plot).

# ATTENTION: for the x-axis you need a factor, or a vector of characters
ggplot(mtcars, aes(as.factor(cyl), mpg)) +
                  geom_violin() +
                  geom_jitter(width = 0.2) +  # we can move each data point a bit to the right or left, to make it visible
                  labs(x = "N of cylinders")

Violin and box plots combined

We can overlay box plots and violin plots made with ggplot2. For that, we need to specify their positions using te function position_dodge

ggplot(mtcars, aes(as.factor(cyl), mpg)) +
                  geom_violin(position = position_dodge(width = .04)) +
                  geom_boxplot(width = .05, position = position_dodge(width = .04)) +
                  geom_jitter(width = 0.2) +  # we can move each data point a bit to the right or left, to make it visible
                  labs(x = "N of cylinders")

Pirate plots

The function pirateplot from the yarrr package allows to easily combine box plots, violin plots, and the representation of every single data point. There are many built-in color palettes you can choose from

library("yarrr")
pirateplot(formula = mpg ~ cyl, 
           data = mtcars,
           pal = "southpark",
           main = "mpg by cyl in mtcars",
           ylim = c(5,45))

You can have up to 3 different categorigal IVs for pirate plots

pirateplot(formula = mpg ~ cyl + gear, 
           data = mtcars,
           pal = "southpark",
           main = "mpg by cyl and by gears",
           ylim = c(5,45))

But, as far as I know, you cannot assign a pirateplot to an object, as you can do with ggplot2. This means that it is difficult to arrange several pirate plots in the same figure (see below).

Arranging several plots in a figure

If you plot with the base graphics package, you can put several plots in 1 figure using the par function But this does not work with graphs made with ggplot2, for which one can use the cowplotor gridExtrapackages instead.

library(cowplot)

plot1 <- ggplot(mtcars, aes(cyl, mpg)) +
                  geom_bar(stat="summary", fun.y = "mean") +
                  labs(x = "N of cylinders") +
                  labs(y = "Average mpg")
plot2 <- ggplot(mtcars, aes(as.factor(cyl), mpg)) +
                  geom_violin(position = position_dodge(width = .04)) +
                  geom_boxplot(width = .05, position = position_dodge(width = .04)) +
                  geom_jitter(width = 0.2) +  # we can move each data point a bit to the right or left, to make it visible
                  labs(x = "N of cylinders")
# let's put plot1 and plot2 together, but in positions 1 and 4 of a 4-plot scenario
plot_grid(plot1, NULL, NULL, plot2, labels = c("A", "B", "C", "D"), ncol = 2)

Saving your plots and graphs

You can save your graphs by hand using the Export button on the right-hand Plots tab.
Or you define in your script how and where to save a graph (see link).
 

Here, I save to .pdf, by first defining the name (with the path!),
calling pdf() before the plot,
and dev.off() after the plot.
 

Beware that if you save directly to a file, you will not see the plot being drawn in R!

SaveDir <- "YOUR_PATH"
library(cowplot)
name <- paste(SaveDir, "/my_amazing_plot.pdf", sep = "")
pdf(name, width = 15, height = 8)  # alternatively used jpeg()  or png()
plot_grid(errorbars1, errorbars2, labels = c("A", "B"))
dev.off()

Flow control

for() loops

When you need to do the same operation (e.g. importing a subject’s data) many times, you might want to use a for loop.

For loops are part of “flow” commands, and execute according to a counter.

For example in this for loop, we start with the number 1, and then add +1 to it for 5 times. We display number at each iteration

number <-  1
for (counter in 1 : 5) {
  number <- number + 1
  print(number)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

For loops can also be nested. for example, here we first add +1 to a number, and then write the word “Pizza” from once to “number” times next to each other

number <-  1
for (counter in 1 : 5) {
  number <- number + 1
  print(number)
  for (counter2 in 1 : number){
    print(rep("Pizza", counter2))
  }
}
## [1] 2
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] 3
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] 4
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza"
## [1] 5
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"
## [1] 6
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"

if… statement

With if statements, we can take decisions, and act accordingly. For example, we could ask if a condition is met (e.g. a number is odd or even), and only if this is TRUE, another statement gets executed

Here we print “Tiramisu” only if the number is odd, otherwise nothing happens

number <- 5
if( (number %% 2) == 1 ) {
  print("Tiramisu")
}
## [1] "Tiramisu"

if…else

Similar to if(), but this time something happens also when the statement is NOT TRUE

e.g. print “Tiramisu” for the number 2, and “Mousse au chocolat” for all other numbers

number <- 5
if (number == 2) {
  print("Tiramisu")
} else {
  print("Mousse au chocolat")
}
## [1] "Mousse au chocolat"

if…else if

Similar to if…else, but more specific.

e.g. print “One” for the number 1, “Two” for the number 2, and “and so forth” for all other numbers

number <- 5
if (number == 1) {
  print("One")
} else if (number == 2) {
  print("Two")
} else {
  print("and so forth")
}
## [1] "and so forth"

ifelse()

Sort of shorthand to if…else statement

Tests for each element of a vector (e.g. a column of a dataframe) if a statement is true or not, and acts accordingly to instructions.

Useful for example to changethe names of the levels of a factor (e.g. imagine you have an IV “Time” with levels 1, 2, and 3; you change them to t_1, t_2, t_3).

The syntax is ifelse(test_expression, x, y)

# check if these numbers are bigger than five. if yes write you are so big, otherwise write you are very tiny
numbers <- c(3,9,4,8,1,20)
ifelse(numbers > 5, "you are so biiig", "you are very tiny")
## [1] "you are very tiny" "you are so biiig"  "you are very tiny"
## [4] "you are so biiig"  "you are very tiny" "you are so biiig"
# in a data frame, replace small numbes (<5) with "too small", and big numbers (>5) with "too large"
numbers <- as.data.frame( c(3,9,4,8,1,20))
colnames(numbers) <- "number"
numbers$number <- ifelse(numbers$number > 5, "too small",  "too large")

ifelse() statements can be nested

# in a data frame, replace "to do R" with "Yuppee", "not to do R" with "Baaah", and "maybe do R" with "there is hope"
data <- as.data.frame( c("to do R", "not to do R", "maybe do R", "to do R", "maybe do R", "to do R"))
colnames(data) <- "motivation"
data$motivation <- ifelse(data$motivation == "to do R", "Yuppee", ifelse(data$motivation == "not to do R", "Baaah", "there is hope"))

Create new variable names in loop

Sometimes you want to create a new variable name for each iteration of a for loop.

For example, to save a plot, which will differ for each iteration, in its own variable.

This will later allow you to manipulate and plot in the same figure, etc.

Use the assign() variable

library(ggplot2)
for (iteration in c(4,6, 8)){
  
  # select cars with that many cylinders
  temp <- as.data.frame(  mtcars[mtcars$cyl == iteration, "mpg"] )
  colnames(temp) <- "mpg"
  
  # create plot
  myplot <- ggplot(temp, aes(y=mpg))  + geom_boxplot() + ggtitle(  paste( iteration, "cylinders")   )
  
  # create a variable name and save the plot in it
  var_name <- paste("plot_", iteration, sep="")
  assign(var_name, myplot )
}
cowplot::plot_grid(plot_4, plot_6, plot_8, ncol = 3)

Create a function

Create a command, save it as a function, and execute it when you want.

Create the function “desert”, which takes a number, and writes “Tiramisu” if the input is 2, or “Lasagne” for all other numbers

desert <- function(number){ 
  if (number == 2) {
  print("Tiramisu")
    } else {
  print("Lasagne")
    }
}

Then use the function by feeding it with different numbers and reading the output

desert(1)
## [1] "Lasagne"
desert(2)
## [1] "Tiramisu"

you can also write the function in an .R script, which you save with the same name (desert.R)

Then, in order to use desert(), source it in R, i.e. tell R the directory you saved it in:

# e.g. if I saved the script as "desert.R" in the folder "Z:/Myfolder"
source("Z:/Myfolder/desert.R")
desert(4/2)
## [1] "Tiramisu"

Take input from user

You can wait for the user to write some text or numbers, and then use that in further commands

for example, these 2 lines of code ask you what you want to eat, and then display your text input

text <- readline(prompt="Hello! What do you want to eat today?: ")
print(text)

The input could also go straight into a function.

Here, answer y for yes, and n for no, to the question “Do you like R?” What is the answer you got?

R_question <- function(){
  text <- readline(prompt="Do you like R? (y/n): ")
  if (text == "y"){
    return("Yeah you rule!")
  } else if (text == "n"){
    return("Baaaaah!!")
  } }

# run the function
R_question()

Of course we could write the R_question() function in a script, save it as R_question.R, tell R where it is (using source()), and then use it also in future R sessions, without the need of recreating the function first.